To reduce fluctuation of random sampling effect especially at bootstrap phase, N-step reward (discounted summation) are useful. By expanding Bellman equation, a N-step target of Q function becomes \(\sum _{k=0}^{N-1} \gamma ^k r_{t+k} + \gamma ^N \max _{a} Q(s_{t+N},a)\).

According to W. Fedus et al., N-step reward can utilize larger buffer more effectively. Even though theoretically N-step reward, which is based on a policy at exploration, is not justified for off-policy, it still works better.

You can create N-step version replay buffer by specifying `Nstep`

parameter at constructors of `ReplayBuffer`

or `PrioritizedReplayBuffer`

. Without modification of its environment, cpprb summarizes N-step rewards and slides “next” values like `next_obs`

.

`Nstep`

parameter is a `dict`

with keys of `"size"`

, `"rew"`

, `"gamma"`

, and `"next"`

. `Nstep["size"]`

is a N-step size and 1-step is identical with ordinary replay buffer (but inefficient). `Nstep["rew"]`

, whose type is `str`

or array-like of `str`

, specifies the (set of) reward(s). `Nstep["gamma"]`

is a discount factor for reward summation. `Nstep["next"]`

, whose type is `str`

or array like of `str`

, specifies the (set of) next type value(s), then `sample`

method returns (i+N)-th value instead of (i+1)-th one.

`sample`

also replaces `"done"`

with N-step version.

cpprb v10 no longer returns \(\gamma ^{N-1}\) since users can always multiply fixed \(\gamma ^N\).

Since N-step buffer temporary store the values into local storage, you need to call `on_episode_end`

member function at the end of the every episode end to flush into main storage properly.

Parameters | Type | Description |
---|---|---|

`size` |
`int` |
Nstep size |

`rew` |
`str` or array-like of `str` |
Nstep reward |

`gamma` |
`float` |
Discount factor |

`next` |
`str` or array-like of `str` |
Next items (e.g. next_obs) |

Return Value | Replace: From \(\to\) To |
---|---|

Next items (e.g. next_obs) | \(s_{t+1} \to s_{t+N}\) |

`rew` |
\(r_t \to \sum _{n=0}^{N-1} \gamma ^n r_{t+n}\) |

`done` |
\(d_t \to 1-\prod _{n=0}^{N-1} (1-d_{t+n})\) |

```
import numpy as np
from cpprb import ReplayBuffer
nstep = 4
gamma = 0.99
discounts = gamma ** nstep
rb = ReplayBuffer(32,{'obs': {"shape": (4,4)},
'act': {"shape": 3},
'rew': {},
'next_obs': {"shape": (4,4)},
'done': {}},
Nstep={"size": nstep,
"gamma": gamma,
"rew": "rew",
"next": "next_obs"})
for i in range(100):
done = 1.0 if i%10 == 9 else 0.0
rb.add(obs=np.zeros((4,4)),
act=np.ones((3)),
rew=1.0,
next_obs=np.zeros((4,4)),
done=0.0)
if done:
rb.on_episode_end()
sample = rb.sample(16)
nstep_target = sample["rew"] + (1-sample["done"]) * discounts * Q(sample["next_obs"]).max(axis=1)
```

This N-step feature assumes sequential transitions in a trajectory (episode) are stored sequentially. If you utilize distributed agent configuration, you must add a single episode simultaneously.